stats: Repro & fix admin stats crash#20855
Conversation
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks for fixing. Just one test question/comment.
/wait
|
|
||
| // Sets up a test using real threads to reproduce a race between deleting scopes | ||
| // and iterating over them. | ||
| class ThreadedTest : public testing::Test { |
There was a problem hiding this comment.
Instead of a stress style test (or in addition) is it possible to do a thread synchronizer style test that is deterministic that also fails?
There was a problem hiding this comment.
I'm looking at it. I definitely don't want it to be lieu of the real-threads stress test. One thing I often find with synchronizers is that once you fix the race, you can't trigger the synchronizer without deadlock, but I'm iterating a bit.
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
|
Thanks -- having a hard time sussing out the clang tidy error from the log; maybe an infra flake? I'll retest and if it still fails I'll look more carefully. /retest |
|
Retrying Azure Pipelines: |
|
/retest |
|
Retrying Azure Pipelines: |
Commit Message: Rolls back the rollback PR #20835 , re-enabling fast admin stats, now that #20855 has landed. Additional Description: This brings the performance back to this state: ``` ------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_AllCountersText 467 ms 466 ms 2 BM_UsedCountersText 37.0 ms 37.0 ms 19 BM_FilteredCountersText 1793 ms 1792 ms 1 BM_AllCountersJson 504 ms 504 ms 1 BM_UsedCountersJson 37.2 ms 37.2 ms 19 BM_FilteredCountersJson 1839 ms 1839 ms 1 ``` So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified. Risk Level: medium -- this re-enables calling of code that previously had had races, though #20855 repro'd and fixes them. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>
Commit Message: Fix Stats::Scope destruct/iterate race by holding onto a weak_ptr<Scope> the scopes_ hash-table in ThreadLocalStore. Also adds GUARDED_BY thread annotation to the scopes_ hash table and refactors a bit to ensure thread safety across all accesses. The thread-safety analysis needs more-than-usual annotation assistance for two reasons: * the analysis system does not see that `ThreadLocalStoreImpl::lock_` and `ThreadLocalStoreImpl::ScopeImpl::parent_.lock_` are the same. * in safeMakeStat call-sites, for code-sharing reasons, we need to take a reference to the guarded `central_cache_` entry before we decide whether we need to take the lock the protects it, so we need to disable analysis in that case. This way we can share the code that finds stats in the TLS-cache without taking locks. A couple of helper methods,`centralCacheLockHeld()` and `centralCacheNoThreadAnalysis()`, were added to allow analysis to run with minimally scoped annotations. A testcase was added which duplicates the race between looping over the stats for admin, and creating/destroying scopes, using the fast /stats implementation that was disconnected in prod in envoyproxy#20835. This PR leaves the fast implementation disconnected, but fixes it. A separate PR will roll back envoyproxy#20835 after this lands. Additional Description: The repro can spot the race by reverting the definition of `ThreadLocalStoreImpl::forEachScope` to to its prior state, taking into account that `scopes_` is now a `map<ScopeImpl*, weak_ptr<ScopeImpl>` rather than a `set<ScopeImpl>`. ``` for (auto iter : scopes_) { f_scope(*(iter.first)); } ``` Then test/server/admin:stats_handler_test, test will fail wtih ``` RUN ] ThreadedTest.Threaded terminate called after throwing an instance of 'std::bad_weak_ptr' what(): bad_weak_ptr ``` Risk Level: low -- scope iteration is being fixed here, but that doesn't happen in production yet. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>
Commit Message: Rolls back the rollback PR envoyproxy#20835 , re-enabling fast admin stats, now that envoyproxy#20855 has landed. Additional Description: This brings the performance back to this state: ``` ------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_AllCountersText 467 ms 466 ms 2 BM_UsedCountersText 37.0 ms 37.0 ms 19 BM_FilteredCountersText 1793 ms 1792 ms 1 BM_AllCountersJson 504 ms 504 ms 1 BM_UsedCountersJson 37.2 ms 37.2 ms 19 BM_FilteredCountersJson 1839 ms 1839 ms 1 ``` So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified. Risk Level: medium -- this re-enables calling of code that previously had had races, though envoyproxy#20855 repro'd and fixes them. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>
Commit Message: Rolls back the rollback PR envoyproxy/envoy#20835 , re-enabling fast admin stats, now that envoyproxy/envoy#20855 has landed. Additional Description: This brings the performance back to this state: ``` ------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_AllCountersText 467 ms 466 ms 2 BM_UsedCountersText 37.0 ms 37.0 ms 19 BM_FilteredCountersText 1793 ms 1792 ms 1 BM_AllCountersJson 504 ms 504 ms 1 BM_UsedCountersJson 37.2 ms 37.2 ms 19 BM_FilteredCountersJson 1839 ms 1839 ms 1 ``` So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified. Risk Level: medium -- this re-enables calling of code that previously had had races, though #20855 repro'd and fixes them. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>
Commit Message: Fix Stats::Scope destruct/iterate race by holding onto a weak_ptr the scopes_ hash-table in ThreadLocalStore. Also adds GUARDED_BY thread annotation to the scopes_ hash table and refactors a bit to ensure thread safety across all accesses. The thread-safety analysis needs more-than-usual annotation assistance for two reasons:
ThreadLocalStoreImpl::lock_andThreadLocalStoreImpl::ScopeImpl::parent_.lock_are the same.central_cache_entry before we decide whether we need to take the lock the protects it, so we need to disable analysis in that case. This way we can share the code that finds stats in the TLS-cache without taking locks.A couple of helper methods,
centralCacheLockHeld()andcentralCacheNoThreadAnalysis(), were added to allow analysis to run with minimally scoped annotations.A testcase was added which duplicates the race between looping over the stats for admin, and creating/destroying scopes, using the fast /stats implementation that was disconnected in prod in #20835. This PR leaves the fast implementation disconnected, but fixes it. A separate PR will roll back #20835 after this lands.
Additional Description: The repro can spot the race by reverting the definition of
ThreadLocalStoreImpl::forEachScopeto to its prior state, taking into account thatscopes_is now amap<ScopeImpl*, weak_ptr<ScopeImpl>rather than aset<ScopeImpl>.Then test/server/admin:stats_handler_test, test will fail wtih
Risk Level: low -- scope iteration is being fixed here, but that doesn't happen in production yet.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a